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Abstract 

Second language (L2) reading comprehension assessment has long relied upon 
classical quantitative, product-oriented measurement techniques (i.e., multiple- 
choice and cloze) in both research and classroom assessment. As Bernhardt 
(1991) clearly demonstrated, these traditionally employed assessment methods are 
unable to capture the complex processes that take place between learner and text. 
The present paper has as its central purpose to enhance and extend the efficiency, 
consistency, and validity of an alternative measure, the immediate free recall 
protocol. Unlike the multiple choice or cloze tests, the recall protocol is a truly 
integrative authentic-task measure, firmly grounded on a constructivist model of 
reading comprehension. Given an understanding of the model, the literature base 
on memorial representation, and the formulation of a "weak-rule" scoring system, 
the present study demonstrates a computerized recall protocol scoring system that 
has high correlation with traditional manual scoring methods. The results further 
demonstrate that the computerized procedure provides efficiency in delivery and 
scoring, enhances consistency, is practical for large-scale assessment, and can 
lead to improved diagnostic and placement testing. Using this system as part of a 
multiple-measures approach, valid and reliable quantitative score information is 
readily available and directly linked to a qualitative database ripe for additional 
examination to advance L2 reading comprehension research and model 
development. 

Keywords: reading comprehension, assessment, recall protocol, computer based 
testing 


Introduction 

Significant attention in the second language (L2) reading comprehension assessment literature 
revolves around the development of a dynamic, learner-based, constructivist model of reading 
comprehension that demands new and equally dynamic paradigms of assessment (Alderson, 
2000; Berkemeyer, 1989; Bernhardt 1983a, 1985, 1986b, 1990, 1991, 2000; Bernhardt and 
Berkemeyer, 1988; Bernhardt andDeVille, 1991; Brisbois, 1992; Maarof, 1998). L2 reading 
comprehension assessment, however, has long relied upon classical quantitative, product- 
oriented measurement techniques (i.e., multiple-choice and cloze) in both research and classroom 
assessment. As regards reading comprehension, much has been written about the shortcomings 
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of traditional assessment such as multiple choice and cloze tests (see for example, Alderson, 
2000; Maarof, 1998) so prevalent in education today. These product-oriented assessment 
methods are traditionally employed to gain insight into the learners' L2 reading comprehension 
abilities, but as Bernhardt (1991) clearly demonstrated, this sort of assessment is not enough 
because it is unable to capture the complex processes that take place between learner and text. In 
order to remedy this problem, a new and welcome trend in reading comprehension assessment 
emerged that looked "more carefully at the authenticity of the assessment tasks and their 
alignment with current research, theory, and instructional practices" (Valencia, 1990: 60). 

The present paper, firmly grounded in a constructivist view of L2 reading comprehension, 
embraces and builds upon the considerable body of research and extensive database developed 
primarily by Bernhardt and her many colleagues. In fact, it is the central purpose of this paper to 
provide a pathway to ultimately enhance and extend the efficiency, consistency, and validity of 
the primary measure used in many of the studies cited above, the immediate free recall protocol. 
This paper clearly demonstrates the beginnings of an efficient, automated recall protocol data 
collection and scoring system for large-scale studies that, when used as a part of a multiple- 
measures approach, significantly enhances quantitative and, more importantly, qualitative data 
collection to further inform research and instruction. The richness of the data may ultimately 
serve to provide greater insight into that which is currently unexplained or unknown, and thus 
help extend our understanding of the complexities of L2 reading comprehension development 
and subsequently further the development of existing and future research models. 

It is beyond the scope of this article, however, to explicate detailed aspects of traditional recall 
protocol development, administration, and analysis using the weighted pausal unit scoring 
system (Johnson, 1970). A significant body of literature already exists to inform the unfamiliar 
reader with this measure (Berkemeyer, 1989, Bernhardt, 1991; Brisbois, 1992; Maarof, 1998, 
among others). Rather, this paper describes the development, testing, and validity of an 
automated recall assessment procedure designed to mirror the already widely used and accepted 
manual procedure and to facilitate efficient scoring and data analysis. 


The recall protocol procedure in L2 reading assessment 

Operating under a learner-based theory of reading comprehension, Bernhardt (1990, 1991) 
demonstrated the inability of traditional, quantitative reading comprehension assessment to 
examine the complex and dynamic processes involved in L2 reading comprehension. She found 
that "if a test is to adequately assess L2 reading ability it must acknowledge the status of the 
reader's knowledge base," and "a successful assessment mechanism must be integrative in 
nature" (Bernhardt, 1991: 193). 

As a research-based, appropriate and suitable alternative, the recall protocol procedure is seen as 
a highly valid and effective L2 reading comprehension assessment measure that provides both 
qualitative and quantitative information (Berkemeyer, 1989; Bernhardt, 1983a, 1985, 1991; 
Bernhardt and Berkemeyer, 1988; Brisbois, 1992; Lee, 1986). According to Berkemeyer (1989: 
131): 
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It does not allow students to guess their way through the text. . .nor does it influence 
students' understanding of the text. In short, the immediate recall protocol demands that 
the reader comprehend the text well enough to be able to recall it in a coherent and 
logical manner.... This procedure allows misunderstandings and gaps in comprehension 
to surface; a feature that other methods of evaluation cannot offer. 

Bernhardt and others (e.g., Berkemeyer 1989, 1991; Brisbois 1992; Lee, 1986, Maarof, 1998) 
have long championed the use of the immediate, free-response reading recall protocol as a 
particularly efficacious measure of a learner-based L2 reading comprehension paradigm. That 
paradigm promotes a multifaceted reading comprehension research and assessment approach 
combining quantitative and qualitative methods in order to arrive at a more complete picture of 
the comprehension process. Clearly, the recall protocol procedure can be an important part of 
that multifaceted approach. As Brisbois (1992: 168) noted: 

. . .testing methods, such as the recall protocol procedure, need to gain wider 
acceptance as a measure of reading comprehension. Not only is the recall 
protocol more sensitive than discrete -point tests, but its sensitivity becomes more 
pronounced as reading proficiency increases. 

Recall as an authentic assessment task 

The authenticity of the recall protocol procedure to assess L2 reading comprehension has come 
under question. Schmidt-Rinehart (1994), for example, reports on research that used the recall 
protocol procedure for analysis. She states, "although the advantages of this comprehension 
measure are well documented . . .one would rarely be asked to perform a similar task in real life" 
(Schmidt-Rinehart, 1994: 186). In fact, nothing could be further from the truth as can be argued 
on three levels. First, from a theoretical perspective the recall protocol procedure has been 
successfully used as a noninvasive technique for understanding how persons have put a message 
together in both LI and L2 research (e.g., Bartlett, 1932; Berkemeyer, 1989, 1991; Bernhardt, 
1983a, 1983b, 1985, 1986a, 1986b, 1990; Brisbois, 1992; Craik and Lockhart, 1972; and 
Rumelhart, 1980; Maarof, 1998, among many others). Second, from a practical level the recall 
task is encountered daily. Who has not experienced being asked by an acquaintance if he or she 
had read an article in some newspaper or magazine? The acquaintance then proceeds to relate 
his or her understanding of the article. Upon later inspection of the original piece we often find 
that items were omitted or embellished upon based on the understanding created by the reader. 
Children often cannot wait to relate to their parents stories they read or that were read to them in 
school. The retelling of what someone has read is, in fact, a daily, rather common occurrence- 
one in which human beings engage naturally and readily in the course of conversation. Third, 
for L2 reading comprehension the recall protocol procedure has been shown to be an important 
part of an integrated, multiple measures assessment approach (Berkemeyer, 1989; Bernhardt, 
1983a, 1985, 1990; Bernhardt and Berkemeyer, 1988; Bernhardt and Deville, 1991; Brisbois, 
1992; Lee, 1986; Maarof, 1998). Unlike discrete point measures, it is an integrative task where 
students write down everything they remember about what they read and thus provide a rich 
sample of their individual construction of the text. 

Thus, the present study posits the recall protocol procedure as an authentic-task and clear 
alternative to use in conjunction with more traditional instruments such as the multiple choice 
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and cloze examinations. It is a procedure that has several distinct advantages (Bernhardt, 1983a: 
31-32): 

1) The recall procedure shows where a lack of grammatical skill interferes with the 
student/text communication. 

2) The recall procedure does not influence the readers' understanding of the text. 

3) The procedure stresses the importance of understanding. Students cannot simply 
guess at answers; they must attempt to form an understanding of the text. 

Construct validity: An evolving model 

The recall protocol procedure has construct validity firmly grounded in a reader-based, 
constructivist model of L2 reading comprehension that has continued to evolve. Bernhardt 
(1985, 1986a, 1990) describes an L2 reading comprehension model that is based in part on a 
psycholinguistic model developed by Coady (1979) and to a great extent on analyses of recall 
protocol data. This interactive model attempted to capture the complex comprehension 
processes that take place during the reader/text interaction. Bernhardt (1990: 25) describes the 
various components of the model as being either text based, or extra-text based. The text-based 
components include: 

1. Word recognition (the attachment of a semantic value). 

2. Phonemic/graphemic decoding (recognition of words based on sound or visual 
match). 

3. Syntactic feature recognition (the relationship between words). 

The extra-text based components of the model consist of: 

1. Intratextual perception (the reconciliation of each part of the text with that which 
precedes and succeeds). 

2. Prior knowledge (whether or not the discourse is sensible according to the reader's 
knowledge of the world). 

3. Metacognition (the extent to which the reader thinks about what he or she is reading). 

Using the model, Bernhardt (1985, 1990) was able to reconstruct the processes and mental 
models developed by the individual reader during comprehension. She found that "discovering 
the mental model can be done using the recall protocol procedure and thereby working from the 
students' reconstructions in order to make students actively attend to their process of model 
building" (Bernhardt, 1990: 41). 

Based on an extensive synthesis of recall protocol data, Bernhardt further extended her model to 
account for data that indicated that "problems or inaccuracies in L2 text processing may be 
differentially linked to L2 literacy development" (Bernhardt, 1991: 168). The theoretical model 
she posited attempted to explain the development of L2 reading proficiency based on the 
following assumptions (Bernhardt, 1991: 169): 

1 . Text processing abilities develop over time. 

2. Readers demonstrate the use of different facets of the features of the model over time. 

3. Errors in understanding can reveal development in literacy. 
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4. Assumes commonality in L2 text processing between and among literate learners and 
languages. 

5. No L2 reader would ever be 100% proficient with a 0% error rate nor would an LI 
reader be 0% proficient with a 100% error rate. 

Thus, based on recall protocol evidence, Bernhardt delineated a construct whereby exhibited 
learner errors in the use of the L2 reading comprehension factors and their interrelationship vary 
as proficiency increases. This model also recognizes that metacognition occurs at all levels of 
proficiency and must still be included in L2 comprehension theory, but acknowledges it as a 
characteristic that varies and is highly dependent on the individual. 

Reading comprehension research in the 1990's further re-examined the relationship between LI 
reading ability and the development of L2 reading ability. Alderson (2000) summarizes research 
that investigates or tries to "resolve the question of whether L2 reading is a language problem or 
a reading problem" (Alderson, 2000: 38). A literature survey conducted by Bernhardt and Kamil 
(1995) subscribed a strong 20% of the variability in L2 comprehension test scores to first 
language (LI) literacy. L2 linguistic knowledge accounted for more than 30% of the variability, 
while 50% of the variability in their research of L2 reading comprehension studies remained 
unexplained. Citing additional studies that demonstrated the importance of L2 knowledge, 
Alderson (2000) posited that while this may be so, L2 learners must also cross a '"linguistic 
threshold' that varies by the difficulty of the comprehension task before LI reading ability 
transfers to the L2 reading task" (Alderson, 2000: 39). 

Building on the understanding that LI reading abilities both promote and hinder L2 reading 
comprehension, Bernhardt (2000) recognized that the relationships between factors in her 1991 
model are also affected by the relationship of the "linguistic overlap" between the two languages 
(Bernhardt, 2000: 803). Noting that the features of her 1991 model provide a picture of L2 
reading ability at a moment in time, Bernhardt determined that an articulation of the 
development of comprehension over time was required. Her revised alternative 
conceptualization (Figure 3) stems from the evolution of 1990's reading comprehension research 
and acknowledges the time required to learn and develop L2 reading ability as related to actual 
comprehension. In this model, as comprehension ability grows (Score 1, 2, or 3) general reading 
ability explains 20% of the result, L2 word and syntax knowledge explain approximately 30%, 
and almost 50% of the variability remains unexplained. While appreciating the important role 
that LI reading ability plays in developing L2 ability, the model also recognizes that 
comprehension scores consist of different elements that the learner brings to bear on any given 
task. In addition, having no zero point on the y-axis, the model recognizes that learners will have 
some comprehension ability especially in cognate-rich languages. Finally, and most importantly, 
the model "promotes the consideration of unexplained variance in individual performance and 
after considerable time in instruction" (Bernhardt, 2000: 804). 

It is precisely the need to further our understanding of the multi-faceted and complex processes 
involved in the development of L2 reading comprehension ability and the realization that much 
is yet unexplained or unexplored that necessitates the drive to enhance the effectiveness and 
efficiency of the recall protocol procedure as a part of a multiple measures approach. Alderson 
(2000) correctly notes that "how researchers operationalize their constructs crucially determines 
the results they will gather and thus the conclusions they can draw and the theories they 
develop." (Alderson, 2000: 356). 
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Disadvantages of the recall protocol 

As with any assessment measure, limitations to the recall protocol exist and must be 
acknowledged. Alderson (2000) and Brisbois (1992) point to the major disadvantage of the 
recall protocol procedure, namely that traditional scoring is very time consuming. While 
Bernhardt (1991) notes that scoring can take up to 10 minutes per recall, Brisbois (1992), based 
on her research, found that "in order for this procedure to attain wider use, however, the scoring 
process needs to be rendered less time consuming. Research into the automatization of this 
process would open the way to increased use of this testing method" (Brisbois, 1992: 169). 

Thus, due to the enormous scoring time requirements and subsequent impact on rater consistency 
over time, traditional studies using the recall protocol were necessarily limited to small groups. 
Clearly, a streamlined, automated procedure would greatly enhance the utility of the recall 
protocol procedure. 

Administration of the recall protocol task can also present problems and affect the resulting data. 
Alderson (2000) and Lee (1986) denote objections that the immediate recall protocol may be 
more of a test of memory rather than a measure of comprehension. These objections are 
minimized since in this procedure, the recall typically occurs immediately after reading. Riley 
and Lee (1996) found that the performance on the recall task varied by the instructions given to 
the subjects. The recalls provided by subjects told to summarize the main ideas of the text were 
found to contain significantly more main idea units than the recalls of subjects simply told to 
write down what they could remember. From their research, it is clear that the task may have an 
effect on what is recalled and must be clearly defined. 

Another frequently listed disadvantage of the recall protocol is that of production difficulties in 
the L2 (Maarof, 1998). If subjects were required to produce their written recall in the L2 the 
results may be confounded by their production ability. To avoid this limitation, most studies 
have required subjects to recall in their native language so as not to interfere with their ability to 
demonstrate comprehension (e.g., Bernhardt, 1983a; Bernhardt and Berkemeyer, 1988; Maarof, 
1998). 


Computerizing the recall protocol procedure: A theoretical framework 

As the goal of the present research was to produce a computer-based, valid, and highly efficient 
authentic-task assessment procedure, the theoretical underpinning for automating the recall 
protocol is founded on the extensive body of literature regarding memorial representation (e.g., 
Kintsch, 1974, 1988; Kintsch and van Dijk, 1978; Weaver and Kintsch, 1991). Kintsch (1974) 
developed an expository text analysis system based on "the notion of propositions as the basic 
unit of meaning" (Weaver and Kintsch, 1991: 233). Kintsch (1974: 62) formulated a general 
theory of episodic memory storage, organization, and retrieval centered around three 
assumptions: 

1. Information is stored as sets of phonemic, semantic, and imagery elements. 

2. The basic operation is one of pattern completion. 

3. The contents of short-term memory bias encoding processes both in perception and 
memory. 
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He postulated a two-stage generation-recognition model that includes an encoding specificity 
principle. The encoding specificity principle, previously advanced by Tulving and Thomson 
(1973: 16), states that "a retrieval cue can provide access to information available about an event 
in the memory store if and only if it has been stored as part of the specific memory trace of the 
event." In Kintsch's model, the storage, retrieval, and organization of semantic elements depend 
greatly on pattern matching and completion. The elements of stimulus propositions match up 
with representations stored in memory and result in the retrieval of stored information. 

Much of Kintsch's work on retrieval from memory stems from previous list-learning research. In 
the case of list retrieval, subjects retrieve from memory relatively unrelated words based on their 
individual ability to develop a retrieval scheme while learning the words. Kintsch (1974: 259) 
points out that textual retrieval is a process whereby "text bases are structured lists of 
propositions, with rich interconnections . . .the retrieval cue, usually the title of the story or the 
like, shares a sufficient number of elements with the episodic memory representation of the text." 

Kintsch also notes research that proposes a multilevel perspective of memory for text. Research 
indicates, for example, that text memorized verbatim is subject to more rapid forgetting than is 
text stored for meaning. Items memorized verbatim appear to be stored in short-term memory 
and thus lack linkage with existing memory nodes. Kintsch (1974) demonstrates that the 
verbatim memory level consists of perceptual-linguistic information that is subject to rapid 
forgetting. The deeper-, or propositional-level representations in memory, having been linked to 
pre-existing propositions, remain available for retrieval far longer than the surface level 
structures. Thus, a hierarchical structure of memory is postulated, which Kintsch and van Dijk 
(1978) later divide into macro- and micropropositions. 

Weaver and Kintsch (1991: 233) define macropropositions as "propositions that contain only 
top-level 'gist' information." They found that micropropositions are directly derived from the 
text and "refer to the smallest definable text units, and completely represent the microstructure of 
the text" (Weaver and Kintsch, 1991: 233). Macropropositions, in contrast, do not contain 
detailed text information, but only surface-level detail. A hierarchical textbase is constructed by 
linking the propositions together, with the important macrostructure information stored at the 
top. It is this top-level information that is recalled in more detail in long-term experiments 
(Weaver and Kintsch, 1991). Detailed, microstructure information is stored toward the bottom 
of the hierarchy, where it is more easily forgotten. 

Guindon and Kintsch (1984) further demonstrate the importance of macropropositions in 
memory. Macropropositions have a "priming effect" and form strong memory units. Guindon 
and Kintsch found evidence that "word pairs from the macrostructure prime each other more 
strongly than word pairs from regular sentences, and within- sentence priming is greater than 
between- sentence priming" (Guindon and Kintsch, 1984: 516). So strong is the effect of 
macropropositions, that readers form them whether "they are stated explicitly in the text or not, 
and whether subjects are asked to do so or not" (Guindon and Kintsch, 1984: 517). 

Continued research by Kintsch (1988) led to the formulation of the construction-integration 
model that "describes how texts are represented in memory in the process of understanding and 
how they are integrated into the comprehender's knowledge base" (Kintsch, Welsch, 

Schmalhofer, and Zimny, 1990: 136). The model postulates weak (dumb) rules that simulate 
comprehension as a production system. 
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Kintsch et al. (1990) found that "dumb" rules that do not always work are superior to so-called 
"smart" rules that attempt always to arrive at the correct interpretation of a text. According to 
Kintsch et al. (1990: 136): 

Weak rules. . .do not generate acceptable representations of the text. Irrelevant or 
contradictory items that have been generated by weak rules, however, can be 
eliminated, if we consider not just the items generated by the rules, but also the 
pattern of interrelationships among them. 

They further found that irrelevant items in a text will only be related to one or to a few items, and 
that contradictory items will be negatively connected to other items. As Kintsch et al. (1990: 
136) note: 

Relevant items, on the other hand, will tend to be strongly interrelated— be it 
because they are derived from the same phrase in the text, or because they are 
close together in the textbase, or because they are related semantically or 
experientially in the comprehender's knowledge base. 

In this model, test sentences or words are compared to the entire text (knowledge) base. Using 
complex activation vector analysis techniques, Kintsch et al. (1990) found that test sentences 
that were highly related to original text bases had strong activation vectors, but if there were no 
connection at all, no activation vector existed. In their words, "the more similar it is to the 
original, the more connections there will be, and the more highly activated the test sentence will 
become" (Kintsch et al., 1990: 137). 

Three components underpin the Kintsch (1988) construction-integration model: a) recognition 
based on list-learning research, b) the hierarchical representation of text, and c) the processing 
mechanism of the construction-integration model. The present research utilized elements of the 
Kintsch construction-integration model to develop an efficient, objective, and reliable L2 recall 
protocol scoring procedure. 


Computerized recall protocol text analysis 

Traditional computer text analysis procedures have proven cumbersome and tentative at best, 
with none being able to analyze texts accurately and completely. Artificial intelligence (Al) 
techniques only now coming into existence may eventually hold the key to true natural language 
processing. Unlike traditional text processing requirements, however, whereby computers 
process unknown texts, the present research was intended to analyze students' recall of a known 
textbase. Intuitively, this procedure requires a matching of the propositions recalled with the 
propositions in the original textbase. This is easier said than done as languages are rich with 
variation, and any proposition can be expressed in a number of ways. For efficient, simple 
processing to occur, algorithms must be found to match recall text with the original textbase as 
closely as possible and give credence to the strength of the relationship. 

Perhaps the development of several computer-coded "weak" rules that tap into the fact that a 
person's recall directly reflects and is related to what has been comprehended will allow for an 
accurate picture of his or her level of comprehension. In line with Kintsch et al. (1990), these 
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"dumb" rules may not always reflect accurately what was comprehended but, by examining the 
patterns of interrelationship, the computer may be able to eliminate, or at least limit the 
inconsistencies. It may be possible that the use of computerized "weak" rules may result in an 
innovative algorithm to produce automated recall protocol scores that closely correlate with 
recalls scored by human raters. 

The state of available technology 

Rapid technological advancements make possible expeditious analyses of large amounts of 
numerical and textual data. Off-the-shelf technology was used in the present study to develop a 
practical and efficient scoring procedure by combining the power of the computer and the recall 
protocol. Integrated computer systems using high-speed processors and mass storage devices as 
described by Baker (1984) and Nitko and Hsu (1984) are now readily accessible. Computer 
concordancing software has revolutionized literary research (Pfaffenberger, 1988; Tang, 1985). 
What was once a lifetime task can now be accomplished in minutes. As a result, concordancing 
moved beyond the field of literary criticism and rapidly became a valuable research tool in fields 
such as political science (Pfaffenberger, 1988) and could also prove useful in qualitative recall 
protocol analysis. 

In addition, state-of-the-art relational databases, spreadsheets, and interactive software authoring 
systems give the researcher the resources necessary to create complex programs and simulations. 
Relational databases cross-link bits of related data stored in separate data files for processing as a 
combined entity. Spreadsheets allow users to make alterations to quantitative data, perform 
complex calculations, and view and evaluate the results instantaneously. Interactive software 
authoring systems give even novice users the power to create complex programs and to 
manipulate electronic information between different data processing software packages. 

Clearly, the advances in software technology make possible sophisticated textual analyses. 
Programs can be developed to carry out weighted pausal unit analysis as described by Johnson 
(1970), or explore the feasibility of the Kintsch text analysis system (Kintsch, 1974; Weaver and 
Kintsch, 1991). An integrated computer assessment system, based on the recall protocol, has the 
potential to provide researchers, teachers, and policy makers with an efficient, valuable, and 
powerful assessment tool. Innovative recall data processing algorithms and procedures, based on 
text analysis research (e.g., Johnson, 1970; Kintsch, 1974), need to be developed, integrated, 
demonstrated, and evaluated. In doing so, the present research will help answer the call to 
"embark on a large-scale effort directed toward the development of practical methods of text 
analysis that allow nonexperts to do with relatively little effort what only experts can do today 
with considerable effort" (Weaver and Kintsch, 1991: 242). 

Development of computer weak-rule scoring algorithms 

The researcher examined the use of a weighted means of calculating recall protocol scores as 
outlined in Bernhardt (1991) and Maarof (1998). In addition, various text processing and 
analysis perspectives (e.g., Kintsch, 1974; Kintsch and van Dijk, 1978) were examined, used, 
combined, and/or modified. 

The first step was to determine if an interrelationship could be detected between the text of a 
student's recall and the propositions of the textbase as presented by Guindon and Kintsch (1984), 
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Kintsch (1988), and Kintsch et al. (1990). Several "findings" from Guindon and Kintsch (1984) 
are particularly relevant. These findings state that: 

1. "Words belonging to the same sentential unit should prime each other more than 
words that do not belong to the same unit" (p. 509). 

2. "Words from the same sentence unit, in turn, prime each other more strongly than 
words from different sentences" (p. 514). 

3. "Within-sentence priming is greater than between- sentence priming" (p. 516). 

4. "Words that belong to the macrostructure of a text are recognized faster than 
microstructure words, irrespective of priming effects" (p. 512). 

5. "One content word from a macroproposition provides ready access to another word 
from the same macroproposition" (p. 514). 

After careful examination of recalls from a previous study (Allen, Berry, Bernhardt, and Demel, 
1988), the interrelationships that Kintsch and his colleagues described were indeed found to be 
evident- where recall of a passage was evident, closely related propositions from the original text 
appeared in proximity to each other in the subject recalls. They may not always have been 
worded the same way, but the relationship was unmistakable. For example, consider the 
following sentence recalled in English by a high school German student from an authentic L2 
text on a visit by then President Reagan to Moscow: "The man from Spartanburg South Carolina 
said, 'If I'd wanted to see Russians, I'd have bought a Russian TV.'" Compare this to the original 
sentence, "'When I want to see a Russian, I will buy myself a Russian TV, said a citizen from 
Spartanburg, in the State of North Carolina." Examination of all student recalls revealed similar, 
consistent linkage patterns, the variety and quality of which greatly depended on the depth and 
amount of material recalled. 

Weak rule scoring. Based on an analysis of these relationships, the researcher developed two 
weak-rule scoring systems. To summarize these rules, the first, designated "link-only" scoring, 
was designed to examine the recall word-by-word from the beginning and match these words to 
propositions in the original text. Eink-only scoring captured and credited items in the recalled 
text based on their relationship to the original text. Under these rules, items that were collocated 
in the recall were ranked according to the following priority: a) the proposition represented by 
the current word corresponds +/- 1 to the proposition represented by the following word; b) the 
proposition represented by the current word is in the same sentence as a proposition represented 
by the following word; and c) the proposition represented by the current word is the same 
proposition represented by the following word. The computer link-only procedure, using these 
priorities, attempts to build as many unbroken linkages possible. The propositions contained 
within the linkages with the strongest relationship (i.e., proximity and propositional valuation) 
are then credited as having been remembered. 

A second weak-rule scoring system, "word-only" scoring, was developed to capture certain 
words in the comprehender's textbase that could not be readily linked to other surrounding 
propositions, though their presence indicates that the reader indeed comprehended something 
from the text (i.e., the word did not simply appear by chance). Eor example, a recalled reference 
to the "Rose Bowl" mentioned in an original text about Ronald Reagan's visit to Moscow 
(Bernhardt, 1991) would be credited as having been recalled, even if it occurred in isolation with 
no linkages possible. 
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Computerized recall protocol assessment system 

In order to test these rules and answer the research question, a computerized recall protocol 
assessment software package (Figure 1) was developed by the researcher using state-of-the-art, 
readily available, off-the-shelf technology that can be found on almost any desktop personal 
computer today (e.g., Microsoft Corporation's Word, Excel, and Visual Basic software). The 
first step was to develop a system to deliver the original texts and capture the student recalls. 

This was done using Microsoft Visual Basic on a Windows-based platform. In this system, 
students would read the original L2 texts for as long as they desired. When ready, they clicked 
on a button, the text permanently disappeared and they would then immediately input their recall 
into the computer. Once complete, a mouse click would then bring up the next text and the 
process would repeat until all three texts were delivered and the resulting recalls were captured 
into individual data files. 
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Figure 1: Automated recall protocol assessment system. (Heinz, 1993: 21) 



The second step was to develop the computer scoring system. In this system, Microsoft Excel 
spreadsheets were used to store data such as the generated student scores, the recall rubrics 
containing the propositional scores (weights), and additional data required to facilitate 
processing. The recall rubrics contained in the system mirrored those developed by the human 
raters and used in the manual scoring procedure. 

Text processing. At the heart of the system is the recall data processor and "expert system" that 
contained the recall scoring program itself (Figure 2). The program was designed to assess 
student-generated recalls using the weak-rule scoring systems previously described. These rules 
were carefully converted into software algorithms and repeatedly tested for reliability using 
sample recalls from a previous study. 
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Prior to processing, student-generated recalls must be spell checked to ensure that the computer 
can recognize the words. Due to the limited availability of authoring systems and the level of 
computer expertise of the researcher at the time, spell checking was done one text at a time (with 
currently available software, this process can now be easily automated and integrated). The 
researcher was careful to simply correct misspellings and not change words. In addition, a 
parsing routine, based on algorithms described by Smith (1991), was designed to ensure that 
variations of words (i.e., past tense, plurals, etc.) would be recognized by the computer as being 
equivalent to their root words (i.e., "played" equals "play"; "carries" equals "carry"). 

Figure 2: Recall scoring program. (Heinz, 1993:78) 



Once spell-checked and parsed, the computer program was designed to process the entire set of 
student recalls one at a time through the Recall Data Processor and submit them to the "Link 
Find" module. At this point, the computer examined each student-generated recall, word-for- 
word, and attempted to credit propositional matches between the recall and the original text in 
accordance with the algorithms outlined above. As a part of the process, every student-generated 
word was compared to a list of logical and acceptable synonyms (as determined a priori by a 
panel of trained raters). This procedure mirrors the process that human raters routinely make as 
they score the recalls and attempt to match propositions. Once all possible linkages were 
formed, evaluated for their strength of relationship and credited, the recall was passed to the 
"Word Find" routine that then attempted to make propositional matches based on the remaining 
unmatched unique words. Finally, the computer recorded the matched propositions and 
proceeded to score the next subject's recall. This process continued until all recalls in the study 
were scored. 
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Research question 

Given an understanding of the recall protocol, the literature base on memorial representation, and 
the formulation of a "weak-rule" scoring system, the major research question was to determine if 
there is a significant correlation between recall protocol scores that are manually assessed using a 
pausal unit analysis procedure (Bernhardt, 1991) and computer- generated recall scores using the 
algorithms described. Is it possible, in other words, to use commonly available technology to 
mirror the processes used by human raters to come to an acceptable and equally valid 
quantitative score that might help guide and facilitate qualitative examination of recall data on a 
large number of subjects? 


Research design 

The present study used a series of Pearson product-moment correlation coefficients between 
scores manually generated by trained raters using the pausal unit analysis system and the 
computer- generated recall scores. In addition, the researcher conducted a qualitative analysis on 
the scoring procedure by analyzing how the computer-scored propositions compared with those 
scored by the human raters for each student-generated text. The researcher also collected scoring 
time intervals and analyzed them in order to measure efficiency. 

Independent variables 

The two independent variables consisted of the various methods of recall protocol assessment, 
specifically: a) the manually scored pausal unit analysis procedure (Bernhardt, 1991); and b) the 
automated computer analysis procedure developed in the present study. 

The recall texts used in the present study were administered by the researcher-developed 
computer program. Each subject received three authentic German passages (see Appendix) of 
approximately 200 words in length delivered in random order. Subjects were allowed to read the 
L2 text and then to write in their LI (English) all that they could remember about that text. The 
subjects typed their recalls directly into the computer. As each subject completed his or her 
recall, the computer automatically stored the recall for further processing. 

Dependent variable 

The dependent variable, reading comprehension, was measured using the scores obtained on 
each of the written recall protocols, and the combined mean score obtained on all three. The 
computer instructed the subjects to read each passage as many times as desired and then to write 
everything they could remember about the text, in English. 

Reading passages 

Eor the research, the two authentic German articles and a personal letter used in a previous study 
(Allen et ah, 1988), each approximately 200 words in length, were selected from newspapers and 
nationally read magazines. The articles were then divided into pausal-breath unit propositions 
and leveled based on hierarchical importance (using a scale of 1 to 4) according to the 
procedures outlined by Bernhardt (1991) by native-speaking raters (all raters were German 
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professors at the United States Air Force Academy where the study took place). Manual scoring 
templates (rubrics) were also developed as described in Bernhardt (1991). From these rubrics, 
identical automated scoring masters (spreadsheets) were prepared and checked by the same 
raters. Finally, for the words contained in every proposition, "logical synonyms" (those that 
indeed fit within the context of the story as judged by the raters) were gathered using the 
thesaurus function of Microsoft Word and agreed upon. Other appropriate synonyms, not in the 
thesaurus but deemed acceptable by the raters were also added to the scoring masters (i.e., 

"Soviet leader" as a reference to Soviet President Gorbachev) as deemed appropriate by the 
raters. 

Subjects 

Two hundred and forty students studying German at the United States Air Force Academy in 
Colorado Springs, Colorado were preliminarily selected at random from the first-, second-, and 
third-year course levels (henceforth noted as levels 1, 2, or 3 of ability). The university's 
selection process (students must graduate high school from within the top 10% academically and 
be U.S. citizens) ensured that the subjects were all highly literate LI English speakers. In 
addition, they routinely used computers and were familiar and comfortable with using them. All 
students were required to take a foreign language placement test upon admission to the 
university. Their initial course-level placement, however, was not only based on the placement 
test scores but also previous (high school and personal) experience in the L2. 

Procedures 

Subjects received three practice, paper and pencil recall protocol assessments during their normal 
class periods before data collection took place in order to familiarize them with the procedure. 
Additionally, all subjects received a familiarization session on the computer. The actual 
assessment was conducted during the 15th class meeting after the beginning of the fall term in 
normal class periods. The assessment measure was administered to all 240 subjects, but 100 
were randomly selected (in proportion to the number of subjects at each level of German ability) 
for inclusion in the actual study. This allowed for ample additional data sets should errors or 
computer glitches occur in the processing or capture of data while students were on-line. 

Subjects were not told whether or not they were selected to be included in the final sample. 
Subjects worked independently on individual computers located in a central testing facility and 
the three original L2 texts were presented to the students on the computer screen in a random 
order. Subjects were given as much time as they needed to read the text. As per the program 
design, once they indicated that they were ready to write their recalls, the text disappeared and 
they were prompted to enter their recall into the computer in their native English. This 
procedure was repeated until all three texts were administered. 

Data analysis 

All student recalls were spell checked and then submitted to the parsing routine developed by the 
researcher as previously outlined. Again it must be emphasized that with currently available 
software this process is now easily automated so that students conceivably could accomplish the 
spell check as they complete their recall. 
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Once parsing and spell checking were completed, the recalls were submitted to the scoring 
program and automatically scored. Concurrently, the recalls were manually scored in a 
traditional manner (as specified in Bernhardt, 1991) by the previously trained raters working 
independently. Each rater scored the entire data set for only one of the three texts and worked 
during one, day-long session to score the entire set of 100 recalls. The researcher also scored the 
first 20 randomly selected recalls to establish interrater reliability, and randomly scored 10 of the 
remaining protocols of each text. Interrater reliability averaged 0.98 across all recalls and raters. 
The automated and manually scored recalls were submitted to correlational analysis (Pearson 
product-moment), and the automated recalls were further submitted to item and qualitative 
analyses. 


Table 1: Recall protocol analysis summary of correlations (Manual versus Computer 


Analysis) 


Text 

Correlation 

Batman Article 

0.88 

Bernhardt Letter 

0.89 

Travel Article 

0.87 

Average "T" Scores 

0.94 


Results 

The results of the correlational analysis for the three articles and an overall, combined score 
condition is depicted in Table 1. Consistent with previous research (Bernhardt, 1991), recall 
protocol scores are ultimately combined in order to compensate for varied background/topic 
knowledge. In the present analysis, scores have been standardized through the T-scoring 
method. This method was used to ease comparison since the three recall protocols each had 
different maximum scores. The highly positive correlations for the computer- scored versus 
manually-scored analyses of the three recall protocol texts, and the combined average "T"-scores 
obtained from the data were found to be statistically significant (p<.001, df= 98). 


Table 2: Summary of recall protocol computer analysis-combined scores 



Manual 

Computer 

Pearson product- moment "r": 

N/A 

N/A 

Standard Deviation: 

9.41 

8.93 

Mean Score: 

50.00 

50.00 

Minimum score: 

35.56 

34.97 

Maximum Score: 

75.65 

69.83 

Mean Number of Propositions Recalled: 

13.23 

15.50 


Note: Raw scores converted to T-scores. Sample size = 100. *p < .00 


l,df= 98 


For the combined analysis (Table 2), the correlation between the two scoring procedures was 
0.94, meaning that the computer procedure accounted for approximately 88.3% of the variance. 
In addition. Table 3 shows that overall, both the manual and automated scoring procedures 
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display clear and similar effects by both the mean number of propositions recalled and level of 
instruction. The computer order of recall tracks with previous research (Bernhardt, 1991) with 
level 4 propositions being recalled more often than level 3, followed by level 1, and finally by 
level 2. In contrast, the manual results in the combined condition differed slightly from previous 
research in that, overall, the propositions most frequently recalled were level 4, followed by level 
1, followed by level 3, followed by level 2. 


Table 3: Table of selected means-computer analysis versus manual scoring-combined scores 


By Proposition Level 



Mean Number of Propositions Recalled 

Proposition Level 

Number of Subjects 

Manual 

Computer 

1 

100 

4.12 

3.88 

2 

100 

2.71 

3.46 

3 

100 

4.04 

4.69 

4 

100 

4.54 

5.22 


By Level of Instruction 



Mean Score (T) 

Level 

Number of Subjects 

Manual 

Computer 

1 

69 

45.70 

46.10 

2 

16 

56.00 

55.90 

3 

15 

65.90 

63.80 


As regards level of instruction, both the computer and manually scored procedures also show 
that the advanced (level 3) subjects outscored both the level 2 and level 1 subjects. Figure 3 
graphically depicts the results by level of instruction for both the combined manual and 
automated scoring procedures. In the figure, the similarity of the results between the two 
procedures becomes evident. Thus, both scoring systems provide valid and quantifiable data that 
generate similar information on the ability level of a subject. 
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Figure 3: Distribution of Average Total T-Scores— Combined Manual and Computer Analyses 
(Heinz, 1993: 148). 

Manual Analysis 


14 



Computer Analysis 



Note: Levels 1, 2, 3 correspond to the Ist-year, 2nd-year, and 3rd-year German course levels 
(proficiency)of the subjects. 

One advantage of using the automated system to generate scores clearly becomes evident when 
scoring times are examined. As Table 4 indicates, manual scoring required an average of 2.8 
minutes per recall. In fact, the raters completed their 100 assigned recalls in an average of four 
and one-half hours of dedicated, on-task time. These times do not include rest breaks, lunch, or 
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consultation time (with the researcher), which added almost another three hours for a total time 
of approximately eight hours required to complete the assigned task. 


Table 4: Recall scoring time comparisons (N=100) 



Manual Scoring 

Computer Scoring 

Avg./Recall 

(min.) 

Total 

(hh:mm) 

Avg./Recall 

(min.) 

Total 

(hh:mm) 

Bernhardt Letter 

2.41 

4:00 

0.23 

0:28 

Batman Article 

3.02 

5:02 

0.38 

0:38 

Travel Article 

2.42 

4:02 

0.34 

0:33 

Average Time 

2.62 

4:21 

0.32 

0:33 


Note: Manual rater total scoring time represents time actually spent scoring and does not include breaks, 
lunch, or any time spent between recalls. Computer total scoring time represents processing time spent on 
actual recall computation, and does not include pre-computation data preparation time (i.e., spell check, 
etc.). 


It was also noted that the human raters, though working diligently at the task, required far more 
frequent breaks as the scoring session continued. The raters were asked to remain at the task 
until completely finished. In order to help control for fatigue, raters were allowed to spend as 
much time as they needed to do all 100 recalls. It was felt that although fatigue would be a 
minimal factor, completing the task the same day was better than allowing raters to take the 
protocols home where they could forget how they rated previously, resulting in additional 
inconsistencies in the scoring process. 

What is clearly evident is the overwhelming time saving superiority of the computer scoring 
method. Processed on an early 1990's home computer operating at a clock speed of only 33MHz 
(newer machines now operate in the 2 to 3 GHz range), each recall was scored in an average 
time of 32 seconds, or only 33 minutes for all 100 recalls. Additional time was required for spell 
checking and parsing, as previously noted, but the development of an automated facility now is 
definitely feasible. Once incorporated into the overall protocol analysis system, an automated 
parser would streamline the entire process with a minimal increase in processing time, and spell- 
checking would be accomplished on-line with the subject. 

The bottom line, however, is the undeniable fact that the automated system enjoys a great 
advantage in processing time over traditional manual scoring, especially when the costs 
associated with time and labor are added into the equation. The system described here for the 
first time makes large-scale L2 reading comprehension studies using the recall protocol 
procedure a feasible, time-efficient and cost-effective reality. 

Additional analyses 

Using a standard word processor, the student generated recalls were individually examined. This 
qualitative analysis revealed that "errors" by the computer scoring procedure and those made by 
the human raters could be examined, explained, and accounted for. Computer errors were 
mainly caused by errors in the preparation of the master recall scoring template (incomplete 
synonym lists, incorrect data entry, or inappropriate propositions being credited by the weak-rule 
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scoring systems programmed into the procedure). Overall, the computer scoring rules 
overestimated the propositions recalled, but apparently did so in a consistent manner across all 
recalls. Human errors were found to result from instances where a rater may have misjudged or 
missed a valid scoring opportunity, or may have been caused by rater fatigue, but in general, 
occurred inconsistently . 

The data also indicate that both the manual and computer recall scoring procedures used in the 
present study tapped similar text comprehension indicators that became evident in the qualitative 
examinations. Both procedures showed similar recalled propositional pattern groupings across 
the range of subjects. It was extremely rare, for example, for a person to have comprehended a 
top level (level 4 or 3) proposition and for that proposition not to subsequently have a strong 
relationship with several other propositions in the text. 

Further qualitative analysis also revealed that in comparison to the results of the manually scored 
recalls, the automated recall protocol procedure developed for the present study indeed reflects 
the complexity of L2 reading comprehension. Although by no means perfect, the theory-based 
weak-rule scoring system was able to capture what was comprehended in an organized and 
coherent manner that is analyzable quantitatively and subsequently highlights areas for further 
qualitative analysis. The qualitative analysis is further enhanced by the fact that the data are 
aggregated by the computer and are readily available for additional processing and analysis using 
a variety of software packages (e.g., concordancing), something not easily accomplished with 
hand written recalls. Thus, the information provided by the automated scoring process not only 
serves to provide quantifiable recall scoring, but is also a valid indicator of what is 
comprehended, how it is comprehended, and by whom it is comprehended. 

In the qualitative analysis, the researcher also found evidence to support Cummins' (1979) 
threshold hypothesis, as was confirmatory evidence of Bernhardt's (1991) theoretical distribution 
of reading factors. The L2 learners in the present study, a highly select group attending the 
United States Air Force Academy, typically rank in the top 10% of all high school graduates 
nationally. Thus, they should be highly literate in their LI English. Although they clearly 
possess good LI reading comprehension skills, it is evident from analysis of the recalls that these 
skills did not transfer to the L2, at least not for the beginning students. The advanced-level 
students, however, appear to make use of higher-level processes because their recalls were 
indeed richer and showed evidence of transfer of LI reading comprehension .skills. 


Advantages of the automated recall protocol 

The automated reading recall protocol procedure demonstrated in the present research was found 
to have several distinct advantages. Among them: 

1. It is an authentic, integrative task that is firmly construct referenced. 

2. It is highly efficient and consistent. 

3. It allows for large "N" research and testing. 

4. It greatly reduces scoring subjectivity. 

5. It provides a window into the L2 reading comprehension process. 

6. It is low cost, and uses readily available, state-of-the-art technology. 
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Finally, the procedure addresses Calfee and Hiebert's (1991: 282) six key assessment design 
issues: 

1. It is valid to the extent that it is firmly grounded on a sound reader-based theory of L2 
reading comprehension (Bernhardt, 1991, 2000), uses authentic texts (Bernhardt and 
Berkemeyer, 1988), and requires the subjects to respond in their LI (Alderson, 2000; 
Lee, 1986). 

2. The procedure uses the most direct outward manifestation of the reading 
comprehension process (Lee, 1986), and is thus the most suitable method to obtain 
appropriate and meaningful measures of L2 reading comprehension ability. 

3. The data provided by the procedure can easily be subjected to analysis and are readily 
available to the teacher or researcher. 

4. The procedure has an inherent consistency in scoring, far superior to what human 
raters can achieve. Thus, reliability is significantly enhanced. 

5. The procedure is highly efficient in terms of cost (time, money, and degree of effort 
required to gather the information). 

6. The data collected have been shown to be readily aggregable because a vast database 
of subject-generated information can be collected for analysis. 


Recommendations for further research 

For the automated recall protocol assessment system to become a truly valuable tool for 
educators and researchers, however, the advantages mentioned are clearly not enough. In order 
to be more than just useful, it must also provide insight into the recall process, because "a 
successful assessment mechanism for L2 reading comprehension must provide in-depth 
information on how readers cope with text while, at the same time, providing quantifiable data 
for large-scale comparison and contrast" (Bernhardt, 1991: 194). 

Continuing research using automated recall protocol scoring must take several paths. First, 
although high correlations have been achieved here between manually and computer scored 
recalls, research is needed to improve the correlations. The weak-scoring rule must be refined, 
and this can be accomplished by looking more closely at propositional location in the recall 
versus the original text. Time on both the reading and writing tasks must also be examined as 
potential indicators of comprehension ability. 

The assigning of pausal unit hierarchies needs further exploration. Currently they are determined 
by a team of native (or near native) speaking, trained raters working independently. The analysis 
of recall across many subjects, however, may reveal that L2 learners at various levels "rate" 
propositional importance differently than the native speakers. In addition, like Schmidt-Rinehart 
(1994: 186), during the present analysis, the author found evidence that the weighted four-point 
scoring was equivalent to an unweighted system. To simply throw away hierarchies, however, 
would be to deny that comprehenders unknowingly rate some propositions within a text as being 
more important than others and tend to build comprehension around these structures. While 
additional research is needed, data from the present study indicate that this denial is not valid. 
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Finally, another avenue of exploration is the use of concordancing programs to analyze the 
student-generated data. While the quantitative scores generated by the program may provide 
some insight into the level of ability of the individual subjects, concordancers may provide us the 
vehicle for further qualitative inquiry into L2 reading development to help clarify the 
unexplained and unexplored regions of the Bernhardt (2000) model. We may be able to locate 
verb tenses and determine through correlation whether or not readers are making links between 
and among tense versions in the L2 and their LI reconstructions. In addition, such programs 
may help us look for vocabulary development over time and examine recency and latency effects 
in recalls. For example, are there consistencies in recalled items at the beginning, middle, and 
end of recalls? 


Summary and conclusion 

The automated recall protocol procedure has the potential to greatly enhance L2 reading 
comprehension assessment. The present study demonstrates that the procedure can be used to 
perform large-scale assessment, can lead to enhanced reading comprehension test development, 
and can improve diagnostic and placement testing. No longer will researchers need to rely 
strictly on discrete-point examinations that provide quantitative information but little else. 
Furthermore, unlike the multiple choice or cloze tests, the present procedure is a truly 
integrative, authentic -task firmly grounded on a reader-based construct of reading 
comprehension. In addition, compared to development of valid and reliable multiple-choice 
instruments, an often lengthy and expensive process (Maarof, 1998), development of a valid 
recall protocol assessment rubric is relatively easy. Using the present computer scoring 
procedure, quantitative score information is readily available and is directly linked to a 
qualitative database ripe for additional examination. 

Diagnostic and placement testing 

Further analysis of student-generated data enabled by the automated system will greatly enhance 
diagnostic testing as target language reading comprehension difficulties become evident, and 
thus will provide information to empower dynamic and active instruction. Such analyses can be 
streamlined through the use of word processors, concordancing programs, or statistical 
significant probability analysis. The automated recall protocol procedure also provides 
important data that can be used as the basis of comparison to make placement decisions. In sum, 
as an integrative measure, the automated recall protocol has the potential to become a powerful 
tool especially when used in a multiple measures assessment approach for research, and may 
help explain the unexplainable and unknown and further enhance L2 reading comprehension 
model development. 

Pedagogy 

Moreover, the automated procedure used in the present study has the potential to inform L2 
reading comprehension pedagogy. Understanding the state of a student's L2 reading 
comprehension ability can be an extremely important factor in deciding appropriate reading 
strategies to teach and develop. With this knowledge, teachers can guide and possibly accelerate 
students through development of enhanced L2 reading strategies and skills. Furthermore, once 
the scoring templates are developed, the automated recall protocol procedure provides ready 
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access to student-generated reading comprehension data. These data can help classroom teachers 
address comprehension problems students are having, not just those that someone thinks they 
might be having a priori. 

Teachers will no longer be reliant on expensive, development-intensive multiple-choice 
examinations that provide limited data. Nor will they need to rely on cloze examinations that are 
not truly integrative and thus fail to tap constructivist L2 reading comprehension constructs, as 
used here. For the first time, the L2 teacher can have a powerful assessment device that is easy 
to construct, is immediately available, uses authentic material that he or she chooses, and can be 
used to provide valid, reliable, and efficient reading comprehension assessments throughout the 
course. Used in a multiple-measures approach, the automated reading recall protocol can 
provide information that teachers can use to tailor instruction to student needs and thus promote 
active, dynamic, learner-centered instruction. Thus, ultimately, the use of this assessment 
procedure will not only help provide an accurate picture of the state of the student's development 
but may also help accelerate that development. Making the recall protocol procedure a usable 
reality through fully automated delivery and scoring will provide teachers with the ability to use 
the results (both quantitative and qualitative) for placement, diagnostic and classroom testing, 
and instruction. 
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Appendix: Recall protocol texts 

Batman — Aufstand im Kinderzimmer (Matussek, 1992: 193) 

"Er soil ja besser sein als der erste", sagt die 14jahrige Maria, die mit ihrer Ereundin vor dem 
Eoews-Kino am Broadway ansteht und nur millimeterweise vorriickt in der Schlange, die sich 
vor der Kasse gebildet hat. Obwohl sie an diesem Wochenende mithelfen wird, einen Rekord zu 
brechen, klingt sie nicht gerade begeistert. Es klingt wie: mitmachen und absitzen. Hier wird 
kein Eest angesteuert, sondern eine Hypnose. 

Eur Maria steht der Termin seit Wochen fest, auf einem Plakat, drei Stockwerke hoch iiber 
Times Square, schwarz auf gelb: "Batman kehrt zuriick". Der Eilm. 

Batman, die Geldmaschine, spuckt wieder. Bereits im ersten Anlauf vor zwei Jahren hatte der 
Mann mit der Eledermausmaske Platz sechs in der Eiste der besten Eilme aller Zeit geschafft. 
Nun spielte die Eortsetzung schon am ersten Wochenende 46,5 Millionen Dollar ein. 

Weltrekord. 

Alle amerikanische Kinder seit 1939 sind mit Batman groB geworden. Der Eledermaustyp mit 
der tragischen Kindheit ist ein schiichtemer einsamer Mensch, der sich verwandelt, wenn er sich 
die Maske uberstulpt. Batman, tagsuber braver Burger, ist der Eotse durch die Schattenwelt. 
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Bernhardt Letter (Bernhardt 1991: 124) 

Prof. Dr. E. Buchter-Bernhardt 
227 Arps Hall 
1945 N. High Street 
The Ohio State University 
Columbus, OH 43210 
USA 

Liebe Frau Buchter-Bernhardt, 

in der Anlage finden Sie die Dinge, die ich Ihnen in Newark versprochen habe. Wenn Sie an 
dem einen oder andem von uns interessiert sein sollten, konnen wir dies geme kopieren. 

Unnotig zu sagen, daB es groBen SpaB gemacht hat, Sie kennenzulernen, mit Ihnen zu plaudem 
und gemeinsame Interessen und Bekannte zu entdecken. 

Ob Sie so nett sein konnten, mir bei Gelegenheit den Namen und die Adresse Ihres Mitarbeiters, 
der jetzt in Virginia ist, mitzuteilen, damit ich auch ihm die versprochenen Materialien schicken 
kann. Ich vergaB, mir seine Adresse aufzuschreiben. 

Mit den besten GriiBen und alien guten Wunschen bin ich 

Ihr 


Auf Die Schnelle In Die Feme — Erst Packen Dann Buchen (Rodrian, 1990: 103) 

So schnell kann es gehen: Am Dienstag letzter Woche dachte Peter Frisch noch dariiber nach, ob 
er sich einen Trip nach Spanien leisten konnte. Am Donnerstag jettete er dann doch lieber nach 
San Franzisko. 895 Mark fiirs Ticket nach Kalifornien und zuriick — dieses Angebot hatte den 
Miinchner nicht lange Zogem lassen. MuB man vielleicht mit einer Stewardess verlobt sein, um 
so billig um die halbe Welt zu jetten? Des Ratsels Fosung ist viel einfacher: Als den Miinchner 
das Femweh iiberkam, hatte er sich bei den Fast-Minute-Buros umgehort. Bei der Tonband- 
Ansage von F'Tours wurde er fiindig. 

Noch rascher ging's beim Miinchner Studenten Manfred Kanzler: er packte einfach Zahnbiirste 
uns Scheckbuch ein und fuhr zum Flughafen. Da hatte er noch keine Ahnung, wohin die reise 
gehen sollte. Drei Stunden spater saB er schon im Jet nach Eliat am Roten Meer — fiir 498 Mark. 
Fiir Verkauferin Beate Baskos vom ABR- Fast-Minute-Service am Flughafen ist das nichts 
Ungewohnliches: Sie vermittelt jedes Wochenende Feriengliick gleich dutzendweise in letzter 
Minute. Der SchluB-Verkauf von Urlaubsreisen, vor drei Jahren noch fast unbekannt, erlebt jetzt 
den groBen Boom. 
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